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Abstract 

We consider the problem of PAC-learning decision trees, i.e., learning a decision tree over the 
n-dimensional hypercube from independent random labeled examples. Despite significant effort, 
no polynomial-time algorithm is known for learning polynomial-sized decision trees (even trees 
of any super-constant size), even when examples are assumed to be drawn from the uniform 
distribution on {0,1}™. We give an algorithm that learns arbitrary polynomial-sized decision 
trees for most product distributions. In particular, consider a random product distribution where 
the bias of each bit is chosen independently and uniformly from, say, [.49, .51]. Then with high 
probability over the parameters of the product distribution and the random examples drawn 
from it, the algorithm will learn any tree. More generally, in the spirit of smoothed analysis, we 
consider an arbitrary product distribution whose parameters are specified only up to a [— c, c] 
accuracy (perturbation), for an arbitrarily small positive constant c. 

1 Introduction 

Decision trees are classifiers at the center stage of both the theory and practice of machine 
learning. Despite decades of research, no polynomial-time algorithm is known for PAC-learning 
polynomial-sized (or any super-constant-sized) Boolean decision trees over {0, 1}™, even assum- 
ing examples are drawn from the uniform distribution on inputs. The situation is no better 
for any other constant-bounded product distribution. In light of this, what we show is perhaps 
surprising: every decision tree can be learned from most product distributions. Hence, the 
uniform-distribution assumption common in learning (and other fields) may not be simplifying 
matters as one might hope. 



1.1 Related work 

Learning decision trees in Valiant's PAC model j!3j requires learning an arbitrary tree from 
polynomially-many random labeled examples, drawn independently from an arbitrary distribu- 
tion and labeled according to the tree. Note that the output of the learning algorithm need 
not be a decision tree - any function, which well approximates the target tree on future exam- 
ples drawn from the same distribution as the training data, suffices. The uniform-PAC model of 
learning assumes that data is drawn from the uniform distribution. In previous work, size-s trees 
were shown to be PAC-learnable in time O (n logs ) (HQ]. Juntas, functions that depend on only 

*This work was done while the author was visiting Microsoft Research New England. 
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r "relevant" bits (a special case of decision trees of size 2 r ) can be uniform-PAC learned faster: 
in time roughly O(n 07r ) [10]. A variety of alternatives to PAC learning have been considered, 
to circumvent the difficulties. Random depth-O(logn) trees have been shown to be properljQ 
learnable, with high probability, from uniform random examples by Jackson and Servedio [7]. 
Decision trees have been also shown to be learnable from data which is coming from a random 
walk, i.e., consecutive training examples differ in a single random position [2]. A seminal result 
of Kushilevitz and Mansour (KM) [5j , using an algorithm similar to Goldreich-Levin [?j , shows 
that decision trees are uniform-PAC learnable from membership queries (i.e., black box access 
to the function) in polynomial time. Since KM proved to be an essential ingredient in further 
work such as learning DNFs [6] and agnostic learning [5], as well as to applications beyond 
learning, the present work gives hope to a number of questions discussed in Section [6] 

We consider a "smoothed learning" model inspired by Smoothed Analysis, which Spielman 
and Teng introduced to explain why the simplex method for linear programming (LP) usually 
runs in polynomial time |12j . Roughly speaking, they show that if each parameter of an LP is 
perturbed by a small amount, then the simplex method will run in polynomial time with high 
probability (in fact, the expected run-time will be polynomial). For LP's arising from nature 
or business (as opposed to reduction from another computational problem), the parameters are 
measurements or estimates that have some inherent inaccuracy or uncertainty. Hence, the model 
is reasonable for a large class of interesting LP's. 

1.2 Main result 

We suppose that the examples are coming from a product distribution specified by fi e [0, 1]™ 
where /ii = Ex-v [xi]. An illustrative instantiation of our main result is the following. Take any 
decision tree and pick a random /i e [0.49,0.51]". Then, with high probability (over and the 
random examples from Vfj), our algorithm will output a polynomial threshold function which 
is a good approximation to the tree. Since "Pr.s ^.s) is the uniform distribution, the choice of 
H € [0.49,0.51]™ is close in spirit to the uniform distribution. 

More generally, fix any arbitrarily small constant c € (0,1/4). An adversary, if you will, 
chooses an arbitrary decision tree / and an arbitrary p, e [2c, 1 - 2c]™ but the actual product 
distribution will have parameters \i = p, + A, where A € [— c, c]™ is a uniformly random pertur- 
bation. Then, a polynomial number of examples will be drawn from V^. With high probability 
over the perturbation A and the data drawn from Pp+A, the algorithm will output a function 
which is very close to /. The main theorem we prove is the following. 

Theorem 1. Let c e (0,1/4). Then there is a univariate polynomial q such that, for any 
integers n, s > 1, reals e, S > 0, function f : {0,1}™ -»■ {-1,1} computed by a sizes decision 
tree, and any p, e [2c, 1 - 2c] n , with probability > 1 - S over A chosen uniformly at random from 
[-c, c]™ and m > q(ns/(Se)) training examples (xi, f(xi)), . . . , (x m , f{x m )) where each Xi is 
drawn independently from (where [i = p, + A), the output of algorithm L is h with, 

Pr [h{x) *f(x)]<e. 

Algorithm L is polynomial time, i.e., it runs in time poly(n, m) and outputs a polynomial 
threshold function. 

It is worth making a few remarks about this theorem. Worst-case analysis is beautiful 
but sometimes leads to artificial limitations, especially in domains like learning where we do 
not actually believe that an adversary chooses the problem. In this sense, it is natural to 

The output of their algorithm is a decision-tree classifier. 

Statistically speaking, this distribution is quite different than the uniform distribution. Learning form any 
u € 1/2 - y 1/n, 1/2 + y'l/n would likely be as difficult as learning from the uniform distribution. 
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slightly weaken the power of the adversary. Here, we have assumed that the adversary can only 
specify the product distribution up to [— c, c] accuracy or rather that the adversary may have a 
trembling hand (to misuse a term of Selten As an example of smoothed analysis, ours is 

interesting because unlike linear programming, where worst-case polynomial-time alternatives 
to the simplex were already known, there are no known efficient algorithms for uniform-PAC 
learning decision trees. 

In learning, the standard uniform-PAC model already "assumes away" any adversarial con- 
nection between the function being learned and the distribution over data. Now, the uniform 
distribution assumption is made with the hope that the resulting algorithms may be useful for 
learning or at least shed light on issues involved in the problem; it is a natural first step in de- 
signing general-distribution learning assumptions. We hope that the smoothed analysis serves 
a similar purpose. 

1.3 The approach 

The intuition behind our algorithm is quite simple. It will turn out to be notationally convenient 
to consider examples x e {-1,1}™. Now for starters, consider a decision tree that computes a 
log(n)-sized parity f(x) = YlieS x ii f° r some set S £ {1,2, ... , n}, \S\ = log 2 (n). This can be done 
using a size n tree. Under the uniform distribution on examples, each bit X\ (or any subset of 
< log(n) - 1 bits) is uncorrelated with /. Now take a product distribution with random mean 
vector fi e [-c, c] n and define x' = x - fi, so that E[%i] = 0. Then with probability > 1 - 6, f(x) 
has a significant (poly(5/n)) correlation with each x[ for i e S and no correlation with any i S. 
Hence, it is easy to find the relevant bits. Now, a polynomial size-tree may, in general, involve 
all n bits so finding the relevant bits is not sufficient. 

As is standard for Fourier learning under product distributions, one can write f(x) = f(x') 
as a polynomial in x' . Each coefficient of a term Ylins x 'i can be estimated in a straightfor- 
ward manner from random examples. However, finding the heavy coefficients (those with large 
magnitude) is a bit like finding a number of needles in a haystack. However, this is the most 
fascinating aspect of the problem - it requires so-called feature discovery or feature construction 
algorithms. These algorithms hence tie together a fundamental problem in both the theory and 
practice of learning: many claim that the heart of the problem of machine learning is really that 
of finding or creating good features [9]. 

The key property we prove is the following, with high probability over /i € [-c, c] n . If the 
coefficient in f(x') of a term W^t x '% IS large, then so is the coefficient of TiitS x 'i f° r eacn S QT. 
This makes finding all the large coefficients easy using a top-down approach. The proof of this 
fact relies on two properties: there is a simple relationship between different coefficients under 
different product distributions, and a low-degree nonzero multilinear polynomial cannot be too 
close to too often (this is a continuous generalization of the Schwartz-Zippel theorem). In our 
simple example, it is easy to see that by expanding f(x) = Tlus x i = Tlies( x 'i + f J 'i)i aZZ coefficients 
of terms Flier x \i f° r T - wm t> e nonzero with probability 1. 

Another perspective on the algorithm is that it gives a substitute for KM (equivalently 
Goldreich-Levin) using random examples instead of adaptive queries. It is a weaker substitute 
in that it is only capable of finding large coefficients on terms of 0(log?i). 

2 Organization 

Preliminaries are given in Section [3J Before we give the smoothed algorithm for learning, we 
prove a property about Fourier coefficients under random product distributions in Section [4] We 
then give the algorithm and analysis in Section [5] Conclusions and future work are discussed in 
Section [6] 
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3 Preliminaries 



Let N = {1,2, . .. , n}. As mentioned, for notational ease we consider examples (x,y) with x e 
{-1, 1}" and y e {-1, 1}. For S £ JV, x e M™, let x 5 denote n^s^i- Any function / : {-1, 1}™ -»• R 
can be written uniquely as a multilinear polynomial in x, 

m = e f(s)x s . 

SsN 

The /(5)'s are called the Fourier coefficients. The degree of a multilinear polynomial is deg(/) = 
max{|5| | f(S) + 0}, and with a slight abuse of terminology, we say a polynomial is degree-d if 
deg(/) < d. 

Henceforth we write £ s to denote T,ssn an d T,\s\=d to denote the sum over S £ N such that 
\S\ = d. Similarly for Z|s|>d, an d so forth. Wc write x e u A to denote x chosen uniformly at 
random from set A. One may define an inner product between functions f,g : {-1,1}™ -»■ M. 
by, (f:9) = ^xe u {-i,i}"[f( x )g(x)]. It is easy to see that (xs,xt) is 1 if S = T and otherwise. 
Hence, the 2™ differcn xs's form an orthonormal basis for the set of real- valued functions on 
{-1, 1}™. We thus have that (f,g) = Y,seN f(S)g(S), and Parseval's equality, 

(/,/)= E / 2 (S)= E [f(x)]. 

Son xeu{-l,l} n 

This implies that for any / : {-1,1}™ -»■ [-1,1], T,sj(S) < 1. It is also useful for bounding 
E[(f(x)-g(x)) 2 ] = Z s (f(S)-g(S)y. 

A product distribution over {-1,1}™ is parameterized by its mean vector [i e [-1,1]™, 
where /Ltj = Ex-v^ [xi] and the bits are independent. (We now use T> to avoid confusion with 
product distributions V over {0, 1}™ discussed in the introduction.) The uniform distribution is 
T>q. We say is c-bounded if fn e [-1+c, 1-c] for all i. Fix any constant c e (0, 1/2). We assume 
we have some fixed 2c-bounded product distribution p, e [-1 + 2c, 1 - 2c]™ and that a random 
perturbation A e [-c, c]™ is chosen uniformly at random and the resulting product distribution 
has fi = p, + A. Note that 2? M is c-bounded and called the perturbed product distribution. 

For any distribution V on {-1,1}™, one can similarly define an inner product (f,g)v = 
E x ~v[f(x)g(x)]. In the case of a product distribution 2? M , it is natural to normalize the co- 
ordinates so that they have mean and variance 1. Let z(x,[i) e R™ be the vector defined 
by Zi(fj,,x) = (xi - Hi ) / \J 1 - fif . When /i and x are understood from context, we write just 
z. This normalization gives ~E x ~v fl [zi(x, /«)] = and F> x ~v„ [zf (x, fi)] = 0. Let zs = zs(x,fi) = 
rii e s Zi(x, fi). It is also easy to see that E x ~v„ [zs z t] is 1 if S = T and otherwise. Hence, the 
2™ differen xg's form an orthonormal basis for the set of real-valued functions on {-1, 1}™ with 
respect to We define the normalized Fourier coefficient, for any S c N, 

f(S,n)= E [f(x)zs(x,n)]. (1) 

x-V^ 

Note that this gives a straightforward means of estimating any such coefficient. Also observe 
that f(S,0) = f(S) and that, for any fi e [-1, 1]", 

f( x ) = T 1 f(S,^)z s (x,n). 

s 

Finally, it will be convenient to define a partially normalized Fourier coefficient, 



Note that if fj, e [-1 + c, 1 - c] n then we have, 



|MrtU|/( S .M)U (1 Jfff^, s , /2 ,^l (2) 



In this notation, we also have, 



S ieS S 



Hence, for any /x = fj, + A, 



^ /ta/iXz - m)s = E /(S,/x)((z - m) + A) 

s s 



Collecting terms gives a means for translating between product distributions fi = /i + A: 



/(S, M )= £ /(T,£)A TsS (3) 

T3S 



3.1 Decision trees 



A decision tree T over {-1,1}™ is a rooted binary tree, in which each internal node is labeled 
with an integer i e N, and each leaf is assigned a label of ±1. We consider Boolean decision 
trees, in which case each internal node has exactly two children, and the two outgoing edges are 
labeled, one of them 1 and the other -1. The tree computes a function fj- : {-1,1}™ -»■ {-1,1} 
defined recursively as follows. If the root is a leaf, then the value is simply the value of the leaf. 
Otherwise, say the root is labeled with i, and say it's children are 71 1 and 71, following the 
labels -1 and +1, respectively. The the value of the tree is defined to be the value computed by 
T Xi on x, i.e., f%,X x )- I n other words, 



/M = (M)/r,W + (i-f)/r-,W. 



We assume that no node appears more than once on any path down from the root to a leaf. 
Hence, the above function is a multilinear polynomial / :{-l,l} n -»{-l,l}, but more in some 
cases it may be helpful to think of it as simply a multilinear polynomial / : R™ -> R. The size 
of a decision tree is defined to be the number of leaves. We define the depth of the root of the 
tree to be 0. Thus a depth-d tree computes a degree-d multilinear polynomial. 



4 Fourier properties for random product distributions 

The following lemmas show that, with high probability, for every coefficient f(S) that is suffi- 
ciently large, say 1/(5)1 > b, it is very likely that all subterms T £ S have |/(T)| > a, for some 
a < b. It turns out that this is easier to state in terms of the partially normalized coefficients 
f(S). The following simple lemma is at the heart of the analysis. 

Lemma 2. Take any c e (0, 1/2), fl e [— 1+c, 1-c]" and let fi = ji+A., where A is chosen uniformly 
at random from [— c, c]™. Let f : K™ ->■ R be any multilinear function f(x) = £s /(5 1 , m)(^ - fi) . 
Then for any T c U £ N , a,b > 0, 

Pr [|/(T, M )|<a||/(C/,M)|>6]< ^(4/c)^/ 2 . 

Ae M [-c,c]" V b 

(For events A, B, we define Pr[A|_B] = in the case that Pr[i3] = 0.) In order to prove lemma 
[U we give a continuous variant of Schwartz-Zippel theorem. This lemma states that a nonzero 
degree-d multilinear function cannot be too close to too often over x e [-1, 1]™. 
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Lemma 3. Let g : R" -> R 6e a degree-d multilinear polynomial, g(x) = Y,\s\<d9(S)xs- Suppose 
that there exists S £ N with \S\ = d and \g(S)\ > 1. Then for a uniformly chosen random 
x £ [-1, 1]™, and for any e > 0, we have, 

Pr [ \g{x)\<e]<2 d ^e. 

x~u[-l, 1]™ 

Proof. WLOG let say g(D) = 1 for D = {x\,x 2 , ■ ■ ■ ,Xd} for we can always permute the terms 
and rescale the polynomial so that this coefficient is exactly 1. We first establish that, 

Pr [\g(x)\<e]< Pr [\x D \ < e]. (4) 

K6 M [-l,l] n a;<E w [-l,l]™ 

In other words, the worst case is a monomial. To see this, write, 

g(x) = xigi(x 2 ,x 3 ,.. ., x n ) + g 2 (x 2 , x 3 , ... ,x n ). 

Now, by independence imagine picking x by first picking x 2 ,x 3 , . . . ,x n (later we will pick x{). 
Let 7i = gi(x2, ■ ■ ■ ,x n ) for i = 1,2. Then, consider the two sets Ji = {xi € R : |iei7i +72I < e} and 
I 2 = {xi e R : |xi7i| < e}. These are both intervals, and they are of equal width. However, I 2 is 
centered at the origin. Hence, since x\ is chosen uniformly from [-1,1], we have that for any 
fixed 71,72, Pr a . l6w [_ 11 ] \x\ e 1{\ < T > T xieu v_iu[xi e I 2 ], because I 2 n [-1,1] is at least as wide 
as I\ n [-1, 1]. Hence it suffices to prove the lemma for those functions where g(S) = for all S 
for which 1 £ S. (In fact, this is the worst case.) By symmetry, it suffices to prove the lemma 
for those functions where g(S) = for all S for which i i S, for i = 1, 2, . . . , d. After removing 
all terms S that do not contain D we are left with the function xd, establishing |4]). Now, for 
a loose bound, one can use Markov's inequality: 

Pr[M < e] = Pr [\xn\~ 1/2 > e' 1 ' 2 ] < = e 1/2 2 d . 

In the last step, E[|a;£)j _1 ^ 2 ] = E[|xi | _1 / 2 ] d by independence and symmetry, and a simple calcu- 
lation based on the fact that |xi| is uniform from [0, 1] gives E[|xi| -1 ^ 2 ] = 2. Although we won't 
use it, we mention that one can compute a tight bound, Pr[|xi . . . xj\ < e] = ^Hi=a log 1 - . This 
is shown by induction and Pr[|xia:2 • ■ ■ Xi+i\ < e] = Pr[|zia;2 . . . Xi---\ < ^]dt. □ 

With this lemma in hand, we are now ready to prove Lemma [2] 

Proof of Lemma\E For any set S E N, let A = (A[5], A[iV \ S]) where A[S] e [-c,c] |s| repre- 
sents the coordinates of A that are in S. Let V = U \ T. The main idea is to imagine picking 
A by picking A[N \ V] first (and later picking A[V]). Now, we claim that once A [TV \ V] is 
fixed, f(U,fi) is determined. This follows from ([3]), using the fact that S \U £ N \ V: 

f(U,n)= Y, f(S,0)ns»u. 
S2U 

On the other hand f(T, /i) is not determined only from A[A^ \ V] . Once we have fixed A[A^ \ V] , 
it is now a polynomial in A[V] using ^ again: 

g(A[V]) = f(T, fi) = Y, f(S,fi)A SxT . 

Clearly g is a multilinear polynomial of degree at most \V\. Most importantly, the coefficient of 
A v in g is exactly Es 2 ruv /(^j ft)^S^(TuV) = f(U,(i), since TuV = U. Hence, the choice f(S,fi) 
can be viewed as a degree-d polynomial in the random variable A[V] with leading coefficient 
j{U,fi), and we can apply LemmaH) So, suppose that \f(U,fj,)\ > b. Let g'(x) = IjT 1 c^ v ^ g(xc) , 
so the coefficient of xy in g' is (6- 1 c- |v| )cl y l/(C/,^) > 1. By lemmaH 

Pr i|v| [b(A[F])|<a]= Pr i|V| [b'(^)l < ab' 1 ^] < J\c^ 2 ^. □ 

A[V]s u [-c,c]IV| x€ U [-l,l]IV| V b 
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We now observe that Lemma [5] implies that with high probability, all sub-coefficients of large 
f(S) will be pretty large. 

Lemma 4. Let f : {-1,1}" -+ [-1,1]. Let a,/3>0, deN. Let ce (0,1/2), /2 e [-1 + 2c, 1 - 2c] n , 
and fj, = p, + A where A € [-c, c] n is chosen uniformly at random. Then, 



Pr []Tc[/c]V sMcft tftat \U\ < d a |/(T, h)\ < a A\f(U, fi)\> P] < a 1/2 /T 5/2 (2/c) 

Ae«[-c,c]" 



2</ 



Proof. Since /x is c-bounded, for any S £ N with |5| < d, |/(S»| < |/(S»| < c- d ^\f(S, fx)\, (see 
it suffices to show that, for any a, b > 0, 

Pr [3Tc[/cJV such that \U\ < d A \ f(T,fi)\ <aA\f(U,fi)\ >b]< a 1/; V 5/2 4V M/2 . 

Ae u [-c,e] n 

This is because for a = ac~ d l 2 and b = (3, \f(U,fi)\ > (3 implies \f(U,fi)\ > b, and \f(T,fi)\ < a 
implies \f(U, ^)\ < a. We can bound the above quantity by the union bound using LemmaH It 
is at most, 

E Pr[|/(r,M)|<oA|/(CT, M )|>6]= £ Pr[|/(T, M )|<a| |/(C/, M )| > 6] Pr[|/(C/, M )| > 6] 

\U\<d \U\<d 

< £ E« 1/2 ^ 1/2 (4/c) |£/sT|/2 Pr[|/(^M)l>&] 

|(7|<ciTc;7 

<2 d a 1/2 r 1/2 (4/c) d/2 E Pr[|/(Z7,/x)| > 6] 

(7|<d 

= 2V / V 1/2 (4/c) d/2 E[|{C/ | \U\<dA\f(U,fj,)\>b}\] 

All probabilities in the above are over A [-c, c] n . Finally, there can be at most c~ d b~ 2 
different U £ N such that \f(U,fi)\ > b since Z S / 2 (S» < c" d Ss / 2 (-S>) ^ c ~ d for all /x by 
Parseval's inequality. Hence, the expected number of such U is at most c~ d b~ 2 and we have the 
lemma. □ 



5 Algorithm 

For simplicity, we suppose that the algorithm has exact knowledge of /i. In general, these 
parameters can be estimated to any desired inverse-polynomial accuracy in polynomial time. 
The algorithm is below. 
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Algorithm L. 

Inputs: (x 1 ,y 1 ),...,(x m ,y m ) e R n x{-l,l} and fit [c,l-c] n . 
1. Let zj := x j ^ , for i = 1, 2, . . . ,n and j = 1, 2, . . . , m. 



2. Let := {0}. 
For d = 1 
(a) Let 



3. For d = 1,2, . . . , ^2(1 - max,,,, | W |) : 



:= J^-i u 5u{i} | 5e y d _ x 



1 



> m 



-1/3 I 



(b) If |=5^| > m then abort and output FAIL. 
4. Let p be the following polynomial p : {-1, 1}™ 



5. Output h(x) = sgn(p(x)). 



It is "well-known that functions computed by decision trees can be approximated by sparse 
polynomials, namely, the set of "heavy" coefficients, i.e., those which have large magnitudes. 
These heavy coefficients tend to be on terms of small degree as well. This is true for any constant 
bounded product distribution. 

Lemma 5. Let c e [0, 1/2], let fj, e [-1 + c, 1 - c] n , d e N, (3 > 0, and let f : {-1, 1}" {-1, 1} be 
computed by a sizes decision tree. Then, 

E f 2 (S) > l-(4(l-c/2) d s + 2 d+2 /3). 

S:|/(S,/j)|a^A|S|Sd 

Hence, it is to be shown that algorithm L identifies these heavy coefficients and estimates 
them well. The proof of this lemma is deferred until after the proof of the main theorem. 

Proof of Theorem^ First, note that for any g : {-1,1}" -> R and any distribution V over 
{-1,1}", ¥v x „ v [sgn.{g{x)) + f(x)] < E x ~v[(g(x) ~ f(x)) 2 ]- The reason is that any time 
sgn(g(x)) + f(x), we have that \g(x) - f(x)\ > 1, since / : {-1,1}" -»■ {-1,1}. Hence, it 
suffices to show that with probability > 1 - 5, 



E [( P (x) - f(x)f] = Y,(P(S, - f(S, m)) 2 < e. 



This is what we do. Define the estimate of f(S,p,) (based on the data) to be, 

e(S) = -E2/ J 4- 
m J=1 

By equation ([TJ) , we have that E[e(S')] = f(S,p,), for any fixed S, fx, where the expectation is 
taken over the m data points. Of course, steps (3a) and (4) only evaluate e(S) on a small 
number of sets, but it is helpful to define e for all S. 

Let d= f log^, D= ^(l-rnax^wM), /? = (e/(12s)) 1+2 / c , t = m~V 3 , andr=^. Note 
that D > lo f™ c > d for m = poly(s/e), so the algorithm will at least attempt to estimate all 
coefficients up to degree d. 
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We define the set of gingerbread features to be, 

G = {Scn\ \S\<dA\f(S,n)\>/3}. 

These are the features that we really require for a good approximation. We define the set of 
breadcrumb features to be, 

B = {B £ S I S e G) . 

These are the features which will help us find the gingerbread features. The set of pebble features 
is, 

P={0}u{5ciV| \S\<D, \f(S,fi)\>t-r}. 

These are the features that might possibly be included in S^ n on a "good" run of the algorithm. 
Note that, by Parseval's inequality, \P\ < 1 + (t - t)~ 2 < 1 + 2r 2 < 3r 2 . We will argue that, with 
high probability, G £ £ P. In order to do this, we also consider the set of candidate features, 

C = Pu{Su{i} | S eP, ieN}. 

These are the set of all features that we might possibly estimate (evaluate e(S)) on a "good" run 
of the algorithm. Let us formally call a run of the algorithm "good" if, (a) \f(S,fi) - e(S)\ < r 
for all S £ C and (b) 1/(5, ^)| > t + r for all S e B. First, we claim that (a) implies y n £ P. This 
can be seen by induction, arguing that y^P for all i = 0, 1, . . . ,n. This is trivial for i = 0. If 
it holds for i, then for i + 1, we have that the set of features on iteration i that are estimated 
will all be in C, hence will all be within r of correct. Hence, for any of these features that is 
not in P, we will have |e(s)| < t and it will not be included in =5^. Second we claim that (a) and 
(b) imply that B £ J^„. The proof of this is similarly straightforward by induction. So (a) and 
(b) imply that G £ £ P, since G £ B. Note that since \P\ < 3t~ 2 < m, the algorithm will not 
abort and output FAIL in this case. Now, 

£(p(S,/,)-/(S, M )) 2 < £ (e(S)-/(S,/,)) 2 + £ f\S^)<\P\r 2 + 4(1 - c/2) d S + 2 d+2 /3. 
s s^y n sfs 

This follows from |^„| < \P\ and Lemma [5] Hence, a good run has, 

£(p(S, f0 " /(S, m)) 2 < 3rV + 4(1 - c/2) d s + 2 d+2 /3 < e, 
s 

for the choice of parameters above, because 3r 2 r 2 = (3/16)e, 4(l-c/2) d s < e/3, and 2 d+2 f3 < e/3. 
This means that every good run outputs a hypothesis of error < e. It remains to show that the 
probability of a good run is at least 1-6, which we do by the union bound over the two events 
(a) and (b). By Lemma |4] property (b) fails with probability at most, 

(f + r) 1/2 /T 5/2 (2/c) 2d < 2m- 1/6 (12s/e) c ' < 5/2, 

for some constant c' and m = j>o\y(ns/(Se)). Finally, it remains to show that (a) fails with 
probability at most 5/2. First, we need to bound \z J s \ for each S e C. Let v = 1 - maxi<d € 
[c,l] so that D = l -^v We first observe that \zi(x,/j,)\ < -/f^j=p ^ 2 / v for 

any ieN, and 

x e {-1,1}", by the definition of z. This means that \zs(x,fj,)\ < (2/v)~b~ v < m 1 ! 12 for all 
S e C, x e {-1,1}", using the fact that (2/v) v < e for all v < 1. Finally, by Chernoff-Hoeffding 
bounds, the probability of |e(5) - f(S,^)\ > r on any S e C is at most 2 e -" lT2 /( 2ml/6 ) . Since 
|C| < n|P| < 3nt~ 2 , it suffices to show that this is at most 5/(2\C\) > 5t 2 /(6n). In other words, 

to finish, we need that 2e~ m £ / 32 > 5m~ 2 / 3 /(6n), which is clearly true for m sufficiently large, 
in particular po\y(ns/(5/e)) certainly suffices. □ 
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We now prove Lemma [SJ 

Proof of Lemma\^ Let g ; {-1,1}™ -*■ {-1,0,1} be the function computed by the truncated 
decision tree in which each internal node at depth d has been replaced by a leaf of value 0. 
Then, 

n/(S,M)-s(S>)) 2 = E [(/(*) - 5 (x)) 2 ]= Pr [f(x)tg(x)]<(l-c) d S . 

The last inequality follows from the fact that the probability of reaching any leaf at depth d is 
at most (1 - c) d . Since g is degree d, T,\s\>d //) < (1 - c) d s. Thus by removing all terms of 
degree greater than d, we throw out at most (1 - c) d s mass. Hence, it suffices to show that, 

E f 2 (S,n)<3(l-c) d s + 2 d+2 p. 

S:|/(S»|</3 

This can be done by breaking it into two cases, 

E f 2 (s^)= E / 2 (s,m)+ E / 2 (^m)- 

S:\f(S,u)\<P S:\f(S,^)\<(3A\g(S^)\>2(3 S:|/(S,m)|</3a|s(S,m)|<2/3 

Each S occurring in the first term above contributes at least (3 2 to Y,s(f(S, m) ~ 9 2 (S,h) < 
(1 - c) d s, hence there can be at most (1 - c) d s/f3 2 terms in the first term above, and 

E f 2 {S^)<[i 2{ -^f^ = {l-c) d s. 

Using the fact that (a + b) 2 < 2(a 2 + b 2 ), for any reals a, b, we have, 

E f\S^)< E 2((f(S^)-g(S^)) 2 + g 2 (S^)) 

S:\f(S^)\<f3A\g(S^)\<2f3 S:\f(S^)\<f3A\g(S^)\<20 

Now we know that T,s(f(S,n) -g(S,/i)) 2 < (l-c) d s, so this gives an upper bound of 2(l-c) d s 
on the sum of the first terms in the above. It suffices to show that, 

E g 2 (S, f ,)<2 d + 1 l 3. 

S:|g(S, M )|<2/3 

To see this, note that g has at most 4 rf nonzero terms, as a depth-c? decision tree. And since 
any vector v € R 4 with ||w|| < 1 has ||w||i < 2 d , we have that £s \g(S,[i)\ < 2 d . Finally, 

E ^{s^^^Vgis.^W^p. □ 

S:\g(S^)\<20 S 

6 Conclusions 

In conclusion, we have shown in a precise sense, that all decision trees are learnable from most 
product distributions. The main tool we have is a type of generalization of KM that uses 
random examples drawn from a (perturbed) product distribution, and works only for terms 
of degree O(logn). Learning decision trees is a clear demonstration of the power of a new 
model. However, the questions raised by such a tool are perhaps even more interesting. First, 
can one learn DNFs from most product distributions? Second, can one agnostically learn in 
these settings, for example can one agnostically learn decision trees in this setting? A third and 
very interesting direction would be to go beyond product distributions to arbitrary perturbed 
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distributions. To be precise, let V be an arbitrary distribution on {-1, 1}™. Let a, b [0, c] n be 
two uniformly random perturbation vectors. Consider the distribution in which x is first chosen 
from T> and then each bit Xi is altered as follows: if Xi = 1 then xi is flipped with probability ai, 
if Xi = -1 then xt is flipped with probability bi. This gives a new type of perturbed distribution 
on inputs which is not in general a product distribution. Hence, our current techniques will not 
work but it is possible that others will. 

Finally, we mention that the Goldreich-Levin algorithm 0], similar to KM, has a number 
of applications in computational complexity and other areas. It would be interesting to see if 
these applications could also be studied from random examples, instead of black-box access, in 
a smoothed analysis setting. 

Acknowledgments. We are very grateful to Ran Raz, Ryan O'Donnell, and Prasad Tetali for 
illuminating discussions. 
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